Fast Feature subset selection algorithm based on clustering for high dimensional data

نویسنده

  • S. D. Potdukhe
چکیده

A Feature selection algorithm employ for removing irrelevant, redundant information from the data. Amongst feature subset selection algorithm filter methods are used because of its generality and are usually good choice when numbers of features are large. In cluster analysis, graph-theoretic clustering methods to features are used. In particular, the minimum spanning tree (MST)based clustering algorithms are adopted. A Fast clustering bAsed feature Selection algoriThm (FAST) is based on MST method. In the FAST algorithm, features are divided into clusters by using graph-theoretic clustering methods and then, the most representative feature that is strongly related to target classes is selected. Features in different clusters are relatively independent. A feature subset selection algorithm (FAST) is used to test high dimensional available image, microarray, and text data sets. Traditionally, feature subset selection research has focused on searching for relevant features. The clustering-based strategy of FAST having a high probability of producing a subset of useful and independent features. Keywords— Cluster analysis, Graph-theoretic clustering, Minimum spanning tree, Feature selection, feature subset selection algorithm (FAST), High dimensional data, Filter method. INTRODUCTION Data mining is a process of analyzing data and summarizes it into useful information. In order to achieve successful data mining, feature selection is an essential component. In machine learning feature selection is also known as variable selection or attributes selection. The main idea of feature selection is to choose a subset of features by eliminating irrelevant or no predictive information. It is a process of selecting a subset of original features according to specific criteria. Feature selection is an important and frequently used technique in data mining for dimension reduction. It employ for removing irrelevant, redundant information from the data to speeding up a data mining algorithm, improving learning accuracy, and leading to better model comprehensibility. Supervised, unsupervised and semi-supervised feature selection algorithms are developed as result of process of feature selection algorithm. A supervised feature selection algorithm determines features' relevance by evaluating their correlation with the class or their utility for achieving accurate prediction, and without labels, an unsupervised feature selection algorithm may exploit data variance or data distribution in its evaluation of features' relevance and a semi-supervised feature selection algorithm uses a small amount of labelled data as additional information to improve unsupervised feature selection [2]. Feature subset selection methods can be divided into four major categories: Embedded, Wrapper, Filter, and Hybrid. The embedded methods has a feature selections as a part of the training process and are usually specific to given learning algorithms, and thus possibly more efficient than the other three categories. Machine learning algorithms like decision trees or artificial neural networks are examples of embedded approaches. Wrapper methods assess subsets of variables according to their relevance to a given predictor. The method conducts a search for a good subset using the learning algorithm itself as part of the evaluation function. Filter methods are pre-processing methods. They attempt to assess the useful features from the data, ignoring the effects of the selected feature subset on the performance of the learning algorithm. Examples are methods that select variables by ranking them through compression techniques or by computing correlation with the output. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. The important part of hybrid method is combination of filter and wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods [1]. In cluster analysis, graph theoretic approach is used in many applications. In general graph-theoretic clustering a complete graph is formed by connecting each instance with all its neighbours. Zahn's clustering Algorithm 1. Construct the MST for the set of n patterns given. 2. Identify inconsistent edges in MST. 3. Remove the inconsistent edges to form connected components and call them clusters. In the FAST algorithm, features are divided into clusters by using graph-theoretic clustering methods and then, the most representative feature that is strongly related to target classes is selected. Features in different clusters are relatively independent. A feature subset selection algorithm (FAST) is used to test high dimensional available image, microarray, and text data sets. Traditionally, feature subset selection research has focused on searching for relevant features. The clustering-based strategy of FAST International Journal of Engineering Research and General Science Volume 2, Issue 6, October-November, 2014 ISSN 2091-2730 768 www.ijergs.org having a high probability of producing a subset of useful and independent features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

A Pragmatic Application of Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

Using the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes usual. Thus, mining high dimensional data is an urgent problem of great practical importance. Within the high dimensional data the dimensional reduction is a vital factor, to the purpose the clustering based feature subset selection algorithm is proposed in this particular paper. The chara...

متن کامل

Algorithm For Identifying Relevant Features Using Fast Clustering

In the high dimensional data set having features selection involves identifying a subset of the most useful features that produce compatible results as the original entire set of features. A fast algorithm may be evaluated from both the ability concerns the time required to find a subset of features and the value is required to the quality of the subset of features. Fast clustering based featur...

متن کامل

A New Hybrid Feature Subset Selection Algorithm for the Analysis of Ovarian Cancer Data Using Laser Mass Spectrum

Introduction: Amajor problem in the treatment of cancer is the lack of an appropriate method for the early diagnosis of the disease. The chemical reaction within an organ may be reflected in the form of proteomic patterns in the serum, sputum, or urine. Laser mass spectrometry is a valuable tool for extracting the proteomic patterns from biological samples. A major challenge in extracting such ...

متن کامل

A Fast Clustering-based Feature Subset Selection Algorithm

The paper aims at proposing the fast clustering algorithm for eliminating irrelevant and redundant data. Feature selection is applied to reduce the number of features in many applications where data has hundreds or thousands of features. Existing feature selection methods mainly focus on finding relevant features. In this paper, we show that feature relevance alone is insufficient for efficient...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014